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ABSTRACT 

This paper reports the use of a document distance-based 
approach to automatically expand the number of available 
relevance judgements when these are limited and reduced to 
only positive judgements. This may happen, for example, 
when the only available judgements are extracted from a list 
of references in a published review paper. We compare the 
results on two document sets: OHSUMED, based on medi¬ 
cal research publications, and TREC-8, based on news feeds. 
We show that evaluations based on these expanded relevance 
judgements are more reliable than those using only the ini¬ 
tially available judgements, especially when the number of 
available judgements is very limited. 

Categories and Subject Descriptors 

H. 2.4 [Systems]: Textual databases; H.3.4 [Systems and 
Software]: Performance evaluation 
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I. INTRODUCTION 

An important bottleneck in the development of informa¬ 
tion retrieval (IR) systems is their evaluation. Generating 
human-produced judgements is expensive and time-consum¬ 
ing, and it is not always possible to produce a large set of 
relevance judgements (qrels henceforth). 

We envisage a scenario where the only available qrels are the 
list of references of a survey paper. For example, within the 
area of Evidence Based Medicine (EBM), clinical systematic 
reviews provide the key published evidence that is relevant 
to a specific clinical query, together with a list of references 
that backs up the clinical evidence. This list of references, 
however, covers only a small sample of all relevant refer- 
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ences j3]. Furthermore, only a fraction of the documents 
of a systematic review can be retrieved after performing ex¬ 
haustive searches, mostly due to the fact that there are com¬ 
plex queries and several document repositories 6 . Another 
problem with using the list of references as the only qrels is 
that negative qrels, that is, judgements about non-relevant 
documents, are not included. Any attempts to develop IR. 
systems for such a scenario will need to supplement the list 
of references with something else. In this paper we propose 
to automatically expand the qrels by finding similar docu¬ 
ments. 

2. RELATED WORK 

Using document distance as a criterion to expand a list of 
qrels sounds intuitive. The approach is related to the well- 
known cluster hypothesis: “closely associated documents 
tend to be relevant to the same requests” [9]. This hypoth¬ 
esis has been typically used to improve the quality of the 
retrieval of documents but there is very limited past work 
using the cluster hypothesis to improve the quality of the 
evaluation. 

Previous work on the expansion of an initial set of document 
assessments include the use of Machine Learning. For exam¬ 
ple, Btittcher et al. l] trained over a subset of qrels in order 
to expand the set of qrels. They showed that evaluation re¬ 
sults with the expanded set of qrels had better quality than 
using the source subset of qrels. Quality of the evaluation 
was measured by ranking a set of IR systems according to 
the new expanded qrels, and comparing it against the sys¬ 
tem ordering produced by the original qrels. In the clinical 
domain, Martinez et al. [b] explored the use of re-ranking 
methods based on reduced judgements, and found that the 
use of automatic classifiers would allow to considerably re¬ 
duce the time required for clinicians to identify a large por¬ 
tion (95%) of the relevant documents. Both of these articles 
reported limitations of the classifiers when the initial num¬ 
ber of documents was small. Furthermore, in the scenario 
that we contemplate, where we rely on the list of references 
of a systematic review as the set of qrels, we do not have 
information about negative qrels, and therefore a classifier- 
based approach to expand the set of relevant documents 
would have to deal with this issue. 

More recent work [8] has shown that by relying on docu¬ 
ments retrieved frequently by a diverse set of systems, it is 
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Table 1: List of 16 runs from the terrier package 

possible to build relevance assessments automatically, and 
achieve high correlation with manually judged data. How¬ 
ever this approach has been tested by building on a set of 
competing runs from different research groups, which is not 
always available; and this method does not benefit from ex¬ 
isting qrels. 

Prior work using document distance criteria for expanding 
the qrels includes [7|, who suggests that this approach may 
work for a document collection within the medical domain. 
In this paper we show that this approach improves the qual¬ 
ity of evaluation both for medical and news reports, and 
we therefore add further evidence of the plausibility of this 
method. 

Our work complements that of related work on the study of 
the impact of the number of topics and relevance judgements 
in IR evaluation [2]. 

3. DATA SETS 

We use the OHSUMED collection of medical research pub¬ 
lications, and the TREC-8 collection of news feeds. 

The OHSUMED collection [4j is a corpus containing clin¬ 
ical queries and assessments. We focus on the set of 63 
queries that was used in the TREC-9 Filtering Track. The 
OHSUMED queries were generated to address actual infor¬ 
mation needs for clinicians, and the assessed documents were 
retrieved in two iterations, by relying on the MEDLINE 
search interface^ and the SMART retrieval system respec¬ 
tively. The retrieved documents were judged by a separate 
group of domain experts to the group performing the search. 
As document collection we rely on the 1988-91 subset of 
MEDLINE that was released as test data for the TREC-9 
challenge, which contains 293,856 documents. The judge¬ 
ment set has an average of 50.87 judgements per query, all 
of them positive. Since the original runs of the systems par¬ 
ticipating in the TREC-9 challenge are not available, for 
evaluation we created 16 IR systems implemented with the 
Terrier 3.5 open source package Js]. Table [I] lists the settings 
of the Terrier package used for our runs, which are the same 
settings used by 7 . 

Each document of the OHSUMED collection contains bib¬ 
liographical data (title, authors, etc) plus the abstract. For 
the experiments reported in this paper we used only the 
contents of the abstract. 

The TREC-8 collection 10 comprises disks 4 and 5 of the 
TREC collection, excluding the Congressional Record sub¬ 
collection. We used the test set, which has 50 queries with 
an average of 1,736 qrels per query. Of these, since we want 
to model a scenario where only positive judgements are used, 
we use only the positive qrels, which average 94.56 positive 

1 ht t p: / / ww w .ncbi.nlm.nih.gov/pubmed 


qrels per query. The qrels were generated using the pool¬ 
ing method, taking the top 100 documents retrieved by the 
systems participating in the ad-hoc task of TREC-8. For 
evaluation we used the results of the original systems that 
participated in the ad-hoc track of TREC-8. 

Each document of the TREC-8 collection contains various 
XML markups. Given that each of the multiple sources had 
a different XML tag set, for the experiments reported in this 
paper simply we ignored all lines that had an XML markup. 
The remaining lines consisted mostly of the main text, but 
there were still a few lines left that had meta-data. 

4. DISTANCE VERSUS RELEVANCE 

We first examined the relation between similarity between 
qrel candidates, and their relevance. We obtained the can¬ 
didates by pooling, as explained below for each dataset. For 
every query and for every qrel candidate in the query, we 
computed the minimum distance between the qrel candi¬ 
date and a known positive qrel for the query. The resulting 
(qrel candidate, query) pairs were sorted by distance and 
binned into deciles such that the first decile is formed by 
the top 10% pairs, and so on. Then, within each decile we 
computed the percentage of qrel candidates that were ac¬ 
tually positive qrels. Since the OHSUMED data only had 
positive qrels, for each query we built the list of qrel can¬ 
didates by pooling the top 100 documents per run. There 
was an average of 202.80 qrel candidates per query (12,371 
qrel candidates in totaQ, and those that were not in the 
list of known qrels were tagged as negative judgements. For 
the TREC data, we used the qrels provided by the organ¬ 
isers of TREC. These qrels had been obtained by pooling 
the top 100 documents per run and contained positive and 
negative judgements, with an average of 1,736.60 qrels per 
query (86,830 qrels in total). Due to time and memory con¬ 
straints we have used the first 100 qrels of each query, giving 
a total of 5,000 qrel candidates. 

Figure [I] shows the result. The figure shows a clear relation 
between distance and relevance in both datasets. The rela¬ 
tion is not as marked as reported by [7] but, as we will show 
below, it is sufficient to give an improvement in the evalua¬ 
tion when we expand the original qrels. The reason why the 
results differ from those of prior work is that the pool of doc¬ 
uments in prior work was taken from the global list of known 
qrels, instead of from the runs of the systems. Our pooling 
method reflects a more realistic scenario and makes it pos¬ 
sible to compare the OHSUMED and the TREC datasets. 
We observe that, in general, the percentage of relevant can¬ 
didates drops much quicker in the TREC data than in the 
OHSUMED data. 

For the experiments we used as the distance metric d(x, y) = 
1 — cos(ce, y) where cos(a;, y) is the cosine similarity. The vec¬ 
tor representations were formed by obtaining the tf.idf val¬ 
ues of all words after lowercasing and removing stop words, 
and then taking the top 200 components after performing 
Principal Component Analysis (PCA)[^] These are the same 
settings as described by [7]. 

2 Note that the total number of qrels is slightly lower than 
63*202.80=12,777 due to the existence of qrels shared among 
questions. 

3 These experiments were carried out in Python and the 









Figure 1: Distance versus relevance in the 

OHSUMED and TREC-8 test datasets. 

4.1 Pseudo-qrels for Evaluation 

We expand the original qrels by introducing qrel candidates 
that are close enough to a known positive qrel. The specific 
process to rank the candidates is the same as described in 
Section [4] We then apply a percentile threshold to select 
the pseudo-qrels. In other words, given the list of pairs (qrel 
candidate, query) sorted by distance to the closest positive 
qrel of the query, we select the top K% qrel candidates. We 
will call these added qrel candidates pseudo-qrels. 

The process to find the pseudo-qrels uses a threshold that 
is global to all queries. This means that some queries may 
receive more pseudo-qrels than others, and a query may re¬ 
ceive no pseudo-qrels. As we reduce the threshold, we will 
find more cases where a query has no additional pseudo- 
qrels. We thought that using a global threshold is desirable, 
since if a query only has documents that are relatively far 
from known qrels, we better not add them as pseudo-qrels. 

To test the impact of the number of available qrels, in our 
experiments we have varied the number of qrels per query, 
always making sure that each query had at least one qrel. 
The selected qrels were drawn randomly from the original 
set of qrels, using the same random seed in all experiments. 

4.2 Correlation for ranking IR systems 

To determine the quality of the pseudo-qrels, and keeping in 
mind the scenario envisaged at the introduction, we evaluate 
and rank the set of runs using the qrels plus pseudo-qrels. 
The evaluation metric was MAP. We then compare the rank¬ 
ing of systems against another evaluation where we use the 
complete set of qrels. The system rankings are compared 
using Kendall’s tau. 

We conducted several experiments by varying the percent¬ 
ages of qrels extended with the computed pseudo-qrels. We 
also included a baseline that does not include the additional 
pseudo-qrels. The baseline simulates the default case when 
we only use the available qrels. 

scikit-learn library. 


OHSUMED Data 



Figure 2: Kendall’s tau of system orderings on the 
OHSUMED data 


TREC8 Data 



Figure 3: Kendall’s tau of system orderings on the 
TREC data 

Figure [2] shows the results for the OHSUMED dataset, and 
Figure^fshows the results for the TREC dataset. The figures 
present the results for varying values of K (the percentage 
of top documents selected as pseudo-qrels). We can observe, 
as expected, that larger percentages of qrels lead to higher 
correlation. 

In both cases, we observe a gain of Kendall’s tau for small 
percentages K of the original qrels. The gain is higher in 
the OHSUMED than the TREC dataset. Figure [4] zooms on 
the lower values of K for the TREC data. We appreciate a 
greater gain in some of the smaller values of K. Critically, 
these values represent an original number of qrels that is 
similar to those encountered in our envisaged scenario. 

We observed that selecting a different subset of qrels influ¬ 
ences the resulting tau, especially for the smaller percentages 
of qrels. We tried with several baselines by using different 
random seeds to select the qrels, and compared them with 
the expanded versions with the pseudo-qrels. The gain of 































TREC8 Data 



Figure 4: Kendall’s tau of system orderings focusing 
on the smaller percentages of the TREC data 



Figure 5: Impact of using different initial qrels. In 
all cases, adding pseudo-qrels improved the results 
or remained practically the same. 

adding pseudo-qrels varied depending on the initial choice of 
qrels, but in general there was a gain. Figure[5]illustrates the 
impact of using different initial qrels for the TREC dataset. 

5. CONCLUSIONS 

We have compared the use of document similarity scores in 
two datasets, with the aim to compensate for the limited 
availability of qrels. The advantage of our approach against 
classification-based approaches such as those of prior work 
is that our method is applicable even when there are only 
positive relevance judgements. 

The results are particularly encouraging when the number 
of available relevance judgements is very limited, and they 
suggest the use of distance-metrics extensions of relevance 
judgements as a quick and cheap evaluation step during the 
development stage of information retrieval systems when 
there are few and only positive relevance judgements. It 


can therefore be applied for the development of IR systems 
that search for relevant clinical studies, even when the set 
of known available relevant documents is just the list of ref¬ 
erences of a sample clinical systematic review. 

Further work includes a more comprehensive study of the 
thresholds that lead to the best evaluation setting, and the 
use of variants of distance metrics, other than straight cosine 
distance over a bag-of-words vector space model. Also, given 
that the measure of quality used in this study is based on 
the correlation of rankings with an automated evaluation 
metric, it is desirable to extend this study with real human 
judgements. 

Finally, note that the present study expands the available 
qrels with positive judgements only. A further interesting 
line of research will include the automatic addition of nega¬ 
tive judgements. 
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